\[ \begin{equation} \nonumber \hat{Y}= X\beta \end{equation} \]
The OLS estimators solve:
\[ \begin{equation} \nonumber \hat{\beta}^{OLS} = \underset{\hat{\beta}}{\operatorname{argmin}} \sum_{i=1}^n (y_i - X_i\beta)^2 \end{equation} \]
Shrinkage methods solve:
\[ \begin{equation} \nonumber \hat{\beta} = \underset{\hat{\beta}}{\operatorname{argmin}} \sum_{i=1}^n (y_i - X_i\beta)^2 + \lambda (\alpha\sum_{j=1}^p|\beta_j| + \frac{(1-\alpha)}{2} \sum_{j=1}^p\beta_j^2) \end{equation} \]
What would you use?
…Tricky question….mumble mumble…
mmm..Does not seem to work..
\[ \begin{equation} \nonumber y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 \end{equation} \]
Is this still linear regression?
Yes, it can still be written as
\[ \begin{equation} \nonumber \hat{Y} = X\beta \end{equation} \]
Where
\[ X= \begin{bmatrix} 1 & \textbf{X}_1 & \textbf{X}_1^2 \\ 1 & \textbf{X}_2 & \textbf{X}_2^2 \\ : & : & : \\ 1 & \textbf{X}_n & \textbf{X}_n^2 \\ \end{bmatrix} \]
So…Any regression is linear if it is linear for its parameters.
plot(cbind(lidar$range, lidar$logratio), xlab="x", ylab="y", main="Lidar", pch=19, col="red")
pol <- lm(logratio ~ poly(range,2), data=lidar)
lines(cbind(lidar$range,predict(pol)), col="blue", lwd=2)
Still not very convincing…
Maybe…
Maybe too much…
If we want to stick to polynomials how do we choose the degree?
This time the answer is the fifth!
Each method may work in specific circumstances, and you might want to have a look at each one before choosing the level of complexity of your curve.
But still…If you want an high predictive power then use cross validation (almost no assumptions required)!
So we break the range of X in non-overlapping pieces and we construct \(f_j(X)\) = I(\(o_j\) \(\le\) X \(\le\) \(t_j\)) , a function that depends on a given range of X.
kn <- as.numeric( quantile(lidar$range, (1:4)/5) )
# prepare the matrix of covariates / explanatory variables
x <- matrix(0, dim(lidar)[1], length(kn)+1)
for(j in 1:length(kn)) {
x[,j] <- pmax(lidar$range-kn[j], 0)
}
x[, length(kn)+1] <- lidar$range
# Fit the regression model
ppm <- lm(lidar$logratio ~ x)
plot(logratio~range, data=lidar, pch=19, col='red', cex=1, main="lidar", xlab="x", ylab="y")
lines(predict(ppm)[order(range)]~sort(range), data=lidar, lwd=2, col='blue')
kn <- as.numeric( quantile(lidar$range, (1:4)/5) )
# prepare the matrix of covariates / explanatory variables
x <- matrix(0, dim(lidar)[1], length(kn)+2)
for(j in 1:length(kn)) {
x[,j] <- pmax(lidar$range-kn[j], 0)^2 ## <<<---- Note: here you put the degree
}
x[, length(kn)+1] <- lidar$range
x[, length(kn)+2] <- lidar$range^2 ## <<------ Note: also here(see formula)
# Fit the regression model
ppm <- lm(lidar$logratio ~ x)
plot(logratio~range, data=lidar, pch=19, col='red', cex=1, main="lidar", xlab="x", ylab="y")
lines(predict(ppm)[order(range)]~sort(range), data=lidar, lwd=2, col='blue')
Now it is also differentiable…
\[ \begin{equation} E[y_i|x_i] = \beta_0 + \beta_1 x_i + \sum_{j=1}^k \beta_{j+1} f_j(x_i) \end{equation} \]
Where
\[ \begin{equation} \nonumber f_j(x_i) = (x_i - k_j)_+ \end{equation} \]
\((x_i - k_j)_+ = x_i - k_j\) if positive, zero otherwise.
For a quadratic fit:
\[ \begin{equation} \nonumber E[y_i|x_i] = \beta_0 + \beta_1 x_i + \beta_2x_i^2 + \sum_{j=1}^k \beta_{j+2} f_j(x_i) \end{equation} \]
Where
\[ \begin{equation} \nonumber f_j(x_i) = (x_i - k_j)_+^2 \end{equation} \]
Generally for a p degree fit:
\[ \begin{equation} \nonumber E[y_i|x_i] = \beta_0 + \sum_{j=1}^p \beta_j x_i^j + \sum_{j=1}^k \beta_{j+p} f_j(x_i) \end{equation} \]
Where
\[ \begin{equation} f_j(x_i) = (x_i - k_j)_+^p \end{equation} \]
Issues
We will solve this problem using natural cubic splines. Natural cubic splines are linear at the end knots and have a continuous second order derivative.
\[ \begin{equation} \underset{f}{\operatorname{min}} \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int (f'' (t))^2 dt \end{equation} \]
The solution is a natural cubic spline!
Again, we should find the best trade off between complexity and variance in order to maximize the out of sample MSPE. How to choose lambda?
… CV!
Whether you want to use k-folds or loocv may depend on several factors. In this specific case loocv is less computational intensive. In fact we can simply use the following formula:
\[ \begin{equation} MSPE(\hat{Y}) = \frac{1}{N}{\frac{\sum (y_i - \hat{f}(x_i))}{(1 - trace(S_\lambda)/N)}} \end{equation} \]
In R the package splines has a built-in function to do cross validation.
tmp.cv <- smooth.spline(x=lidar$range, y=lidar$logratio, cv=TRUE,
all.knots=TRUE)
# optimal regularization parameter: tmp.cv$spar = 0.974
# compute the optimal fit
tmp <- smooth.spline(x=lidar$range, y=lidar$logratio, spar=tmp.cv$spar, cv=FALSE,
all.knots=TRUE)
plot(cbind(lidar$range, lidar$logratio), xlab="x", ylab="y", main="Lidar", pch=19, col="red")
lines(tmp, col="blue", lwd=2)
We are interested to predict the quantity of Nitrogen dioxide concentration using the weighted mean of distances to employment centers. To do so we have at disposal real data collected in Boston.
Open the data set Boston and plot with a scatter plot the variable nox as a function of the variable dis. What do you notice?
first
then
Hint: the skeleton of the code is provided for this exercise.
data(Boston, package="MASS")
plot(cbind(Boston$dis, Boston$nox), xlab="Distance from Emp.Center", ylab= "Nitrogen Dioxide", main ="Boston Data", col="green", cex=0.7, pch=19)
reg1 <- lm(nox~dis, data= Boston )
summary(reg1)
##
## Call:
## lm(formula = nox ~ dis, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.12239 -0.05212 -0.01257 0.04391 0.23041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.715343 0.006796 105.26 <2e-16 ***
## dis -0.042331 0.001566 -27.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07412 on 504 degrees of freedom
## Multiple R-squared: 0.5917, Adjusted R-squared: 0.5909
## F-statistic: 730.4 on 1 and 504 DF, p-value: < 2.2e-16
s3 <- list()
for (i in c(2,3,8)) {
pol <- lm(nox ~ poly(dis,i), data=Boston)
s3[[i]] <- pol
}
summary(s3[[2]])
##
## Call:
## lm(formula = nox ~ poly(dis, i), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.129559 -0.044514 -0.007753 0.025778 0.201882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.554695 0.002828 196.16 <2e-16 ***
## poly(dis, i)1 -2.003096 0.063610 -31.49 <2e-16 ***
## poly(dis, i)2 0.856330 0.063610 13.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06361 on 503 degrees of freedom
## Multiple R-squared: 0.6999, Adjusted R-squared: 0.6987
## F-statistic: 586.4 on 2 and 503 DF, p-value: < 2.2e-16
summary(s3[[3]])
##
## Call:
## lm(formula = nox ~ poly(dis, i), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.121130 -0.040619 -0.009738 0.023385 0.194904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.554695 0.002759 201.021 < 2e-16 ***
## poly(dis, i)1 -2.003096 0.062071 -32.271 < 2e-16 ***
## poly(dis, i)2 0.856330 0.062071 13.796 < 2e-16 ***
## poly(dis, i)3 -0.318049 0.062071 -5.124 4.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared: 0.7148, Adjusted R-squared: 0.7131
## F-statistic: 419.3 on 3 and 502 DF, p-value: < 2.2e-16
summary(s3[[8]])
##
## Call:
## lm(formula = nox ~ poly(dis, i), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.132328 -0.037459 -0.008615 0.023667 0.197200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.554695 0.002702 205.312 < 2e-16 ***
## poly(dis, i)1 -2.003096 0.060774 -32.960 < 2e-16 ***
## poly(dis, i)2 0.856330 0.060774 14.091 < 2e-16 ***
## poly(dis, i)3 -0.318049 0.060774 -5.233 2.46e-07 ***
## poly(dis, i)4 0.033547 0.060774 0.552 0.58120
## poly(dis, i)5 0.133009 0.060774 2.189 0.02909 *
## poly(dis, i)6 -0.192439 0.060774 -3.166 0.00164 **
## poly(dis, i)7 0.169628 0.060774 2.791 0.00545 **
## poly(dis, i)8 -0.117703 0.060774 -1.937 0.05334 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06077 on 497 degrees of freedom
## Multiple R-squared: 0.7293, Adjusted R-squared: 0.7249
## F-statistic: 167.4 on 8 and 497 DF, p-value: < 2.2e-16
plot(cbind(Boston$dis, Boston$nox), xlab="Distance from Emp.Center", ylab= "Nitrogen Dioxide", main ="Boston Data", col="green", cex=0.7,
pch=19)
lines(predict(s3[[2]])[order(dis)]~sort(dis), data=Boston, lwd=2, col='blue', lty="dotted")
lines(predict(s3[[3]])[order(dis)]~sort(dis), data=Boston, col="red", lwd=2, lty="dotted")
lines(predict(s3[[8]])[order(dis)]~sort(dis), data=Boston, col="black", lwd=2, lty="dotted")
legend("topright", legend=c("2 degree","3 degree", "8 degree"),
cex=0.7,
lwd=1, col=c("blue","red", "black"))
set.seed(123)
n <- length(Boston$nox)
k <- 5
ii <- sample(rep(1:k, length= n))
pr.2 <- pr.3 <- pr.8 <- rep(NA, length=n)
for (j in 1:k){
hold <- (ii == j)
train <- (ii != j)
xx.tr <- Boston[train,]
xx.te <- Boston[hold,]
pol2 <- lm(nox ~ poly(dis,2), data=xx.tr)
pol3 <- lm(nox ~ poly(dis,3), data=xx.tr)
pol8 <- lm(nox ~ poly(dis,8), data=xx.tr) ## <-- Remember to put data = xx.tr
pr.2[hold] <- predict(pol2, newdata=xx.te)
pr.3[hold] <- predict(pol3, newdata=xx.te)
pr.8[hold] <- predict(pol8, newdata=xx.te)
}
mspe2degree <- mean((pr.2-Boston$nox)^2)
mspe3degree <- mean((pr.3-Boston$nox)^2)
mspe8degree <- mean((pr.8-Boston$nox)^2)
print(cbind(mspe2degree, mspe3degree, mspe8degree))
## mspe2degree mspe3degree mspe8degree
## [1,] 0.004139044 0.003882791 0.01528747
## SMoothing spline with cross validation
tmp.cv <- smooth.spline(x=Boston$dis, y=Boston$nox, cv=TRUE,
all.knots=TRUE)
# compute the optimal fit
tmp <- smooth.spline(x=Boston$dis, y=Boston$nox, spar=tmp.cv$spar, cv=FALSE,
all.knots=TRUE)
plot(cbind(Boston$dis, Boston$nox), xlab="Distance from Emp.Center", ylab= "Nitrogen Dioxide", main ="Boston Data", col="green", cex=0.7,
pch=19)
lines(tmp, col="blue", lwd=2)
## Construct cubic splines and compare with cross validation
library(splines)
n <- nrow(Boston)
set.seed(123)
k <- 5
ii <- sample(rep(1:k, length= n))
pr.3.ns <- pr.5.ns <- pr.10.ns <- pr.40.ns <- rep(0, n)
for(j in 1:k) {
## Construct the splines (note ns() create a basis expansion)
reg3.ns <- lm(Boston[ii != j,]$nox ~ ns(x=dis, df=3), data=Boston[ ii!=j,])
reg5.ns <- lm(Boston[ii != j,]$nox ~ ns(x=dis, df=5), data=Boston[ ii!=j,])
reg10.ns <- lm(Boston[ii != j,]$nox~ ns(x=dis, df=10), data=Boston[ ii!=j,])
reg40.ns <- lm(Boston[ii != j,]$nox ~ns(x=dis, df=40), data=Boston[ ii!=j,])
## Make predictions
pr.3.ns[ ii == j ] <- predict(reg3.ns, newdata=Boston[ii==j,])
pr.5.ns[ ii == j ] <- predict(reg5.ns , newdata=Boston[ii==j,])
pr.10.ns[ ii == j ] <- predict(reg10.ns, newdata=Boston[ii==j,])
pr.40.ns[ ii == j ] <- predict(reg40.ns, newdata=Boston[ii==j,])
}
## Compute the MSPE
mspe3.ns <- mean( (Boston$nox - pr.3.ns)^2 )
mspe5.ns <- mean( (Boston$nox- pr.5.ns)^2 )
mspe10.ns <- mean( (Boston$nox - pr.10.ns)^2 )
mspe40.ns <- mean( (Boston$nox - pr.40.ns)^2 )
mspe.ns <- c(mspe3.ns,mspe5.ns,mspe10.ns,mspe40.ns)
## Look at it
print(mspe.ns)
## [1] 0.003876472 0.003761332 0.003731666 0.004123112
## Give an even better look
plot(c(3,5,10,40),mspe.ns, xlab="degrees of freedom", ylab="mspe", cex=1, pch=19, col="purple", main="Natural Splines")
lines(cbind(c(3,5,10,40),mspe.ns), lty="dotted", col="purple", lwd=1)
What if I predict the value of y using the mean of \(y_i\) at \(x_i\)?
\[ \begin{equation} \nonumber \hat{y} = average(y_i : x_i = x) \end{equation} \]
Or in a certain bandwidth?
\[ \begin{equation} \nonumber \hat{y} = average(y_i : x - 0.5 < x_i < x + 0.5) \end{equation} \]
For predicting values with dis = 2 and dis = 8, we get:
If we want to use the local averages:
\[ \begin{equation} \nonumber \hat{f}(x) =\frac{1}{N_x} \sum_{i=1}^n y_i I(|x_i - x| < h) \end{equation} \]
Where h is the extension of the bandwidth (in the previous example it was 1), \(N_x\) is the number of observations in the given bandwidth and I(boolean) is an indicator function equal to one if the expression inside is true, 0 otherwise.
The same formula can be written as:
\[ \begin{equation} \nonumber \hat{f}(x) =\frac{\sum_{i=1}^n y_i I(|x_i - x| < h)}{\sum_{i=1}^nI(|x_i - x| < h)*1} \end{equation} \]
And without loss of generality:
\[ \begin{equation} \nonumber \hat{f}(x) =\frac{\sum_{i=1}^n y_i K_h(x_i, x)}{\sum_{i=1}^n K_h(x_i, x)} \end{equation} \]
With
\[ \begin{equation} \nonumber K_h(x_i, x) = W(\small{|x_i - x|/h}) \end{equation} \]
\(W(z) = 1\) if \(z \le 1\), 0 otherwise.
## h = 0.2
ksm <- ksmooth(x=Boston$dis, y=Boston$nox, kernel="box", bandwidth=0.20)
plot(cbind(Boston$dis, Boston$nox), xlab="Distance from Emp.Center", ylab= "Nitrogen Dioxide", main ="Boston Data", col="green", cex=0.7,
pch=19)
lines(ksm, lwd=2, col="black")
What if we want a continuous function:
ksm <- ksmooth(x=Boston$dis, y=Boston$nox, kernel="box", bandwidth=1)
plot(cbind(Boston$dis, Boston$nox), xlab="Distance from Emp.Center", ylab= "Nitrogen Dioxide", main ="Boston Data", col="green", cex=0.7,
pch=19)
lines(ksm, lwd=2, col="black")
ksm <- ksmooth(x=Boston$dis, y=Boston$nox, kernel="normal", bandwidth=0.5)
plot(cbind(Boston$dis, Boston$nox), xlab="Distance from Emp.Center", ylab= "Nitrogen Dioxide", main ="Boston Data", col="green", cex=0.7,
pch=19)
lines(ksm, lwd=2, col="black")
The intuition:
Let’s go back to:
\[ \begin{equation} \nonumber K_h(x_i, x) = W(\small{|x_i - x|/h}) \end{equation} \]
Now:
\(W(z) = 1 - z^2\) if \(z \le 1\), 0 otherwise.
What is the intuition behind?
We can assign different weights according to the distance of each observation from the value of interest \(x_j\)
.
What we analyzed:
\[ \begin{equation} \nonumber \hat{f}(x) =\frac{\sum_{i=1}^n y_i K_h(x_i, x)}{\sum_{i=1}^n K_h(x_i, x)} \end{equation} \]
It corresponds to local averages. So, it is equivalent to:
\[ \begin{equation} \nonumber \hat{f}(x) = \underset{\mu}{\operatorname{argmin}}\sum_{i=1}^n (y_i-\mu)^2 K_h(x_i, x) \end{equation} \]
Where the solution \(\mu\) is the weighted average in each bandwith for each \(x_i\).
An alternative way to compute local averages is through:
\[ \begin{equation} \nonumber \hat{\beta}(x) = \underset{\beta_0, \beta_1}{\operatorname{argmin}}\sum_{i=1}^n (y_i-\beta_0 - \beta_1(x_i - x))^2 K_h(x_i, x) \end{equation} \]
Which is equivalent to:
\[ \begin{equation} \nonumber \hat{f}(x) = \underset{\mu}{\operatorname{argmin}}\sum_{i=1}^n (y_i-\mu)^2 K_h(x_i, x) \end{equation} \]
Remember in fact the Taylor expansion:
\[ f(x) = f(x_0) + f'(x_0) (x - x_0) + \frac{1}{2}f''(x_0) (x-x_0)^2 + R \]
\[ \begin{equation} \nonumber \hat{\beta}(x) = \underset{\beta_0, \beta_1}{\operatorname{argmin}}\sum_{i=1}^n (y_i-\beta_0 - \beta_1(x_i - x) - \beta_2(x_i - x)^2)^2 K_h(x_i, x) \end{equation} \]
Or more generally:
\[ \begin{equation} \nonumber \hat{\beta}(x) = \underset{\beta_0, \beta_1}{\operatorname{argmin}}\sum_{i=1}^n (y_i-\beta_0 - \beta_1(x_i - x) - ... - \beta_k (x_i - x)^k)^2 K_h(x_i, x) \end{equation} \]
Note that the higher the degree, the lower the bias, but the higher the variance!
By assuming \(\hat{f}(x) = X\beta\) an alternative expression is:
\[ \begin{equation} \nonumber \hat{\beta}(x) = \underset{\beta_0, \beta_1}{\operatorname{argmin}}\sum_{i=1}^n (y_i-\beta_0 - \beta_1x_i)^2 K_h(x_i, x) \end{equation} \]
Isn’t still a local fit?
For each point we fit a linear regression, weighting differently the observations.
A compact solution is through weighted least squares:
\[ \hat{\beta}(\textbf{x}) =(X' W_\textbf{x} X)^{-1} (X'W_\textbf{x}Y), \forall \textbf{x} \]
\[ W_\textbf{x}= \begin{bmatrix} K_h(x_1,x) & 0 & ... & 0 \\ 0 & K_h(x_2,x) & ... & 0 \\ : & : & : & :\\ 0 & 0 &... & K_h(x_n,x) \\ \end{bmatrix} \]
Is it a linear predictor?
It can still be written as
\[ \begin{equation} \nonumber \hat{f}(x) = S_h Y \end{equation} \]
Where S does not depend on Y and where \(\hat{f}(x)\) is a weighted least squares estimator. Why do you think it is important this property?…LOOCV for free?
Issues:
lp <- loess(nox ~ dis, data= Boston, span=0.50, degree=1, family="gaussian")
plot(cbind(Boston$dis, Boston$nox), xlab="Distance from Emp.Center", ylab= "Nitrogen Dioxide", main ="Boston Data spam = 0.5", col="green", cex=0.7,
pch=19)
lines(predict(lp)[order(dis)]~sort(dis), data=Boston, lwd=2, col='blue')
lp <- loess(nox ~ dis, data= Boston, span=0.01, degree=1, family="gaussian")
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 5.7321
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.0112
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.00021904
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : zero-width neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : k-d tree limited by memory. ncmax= 506
plot(cbind(Boston$dis, Boston$nox), xlab="Distance from Emp.Center", ylab= "Nitrogen Dioxide", main ="Boston Data spam = 0.01 ", col="green", cex=0.7,
pch=19)
lines(predict(lp)[order(dis)]~sort(dis), data=Boston, lwd=2, col='blue')
R is complaining!!
lp <- loess(nox ~ dis, data= Boston, span=2, degree=2, family="gaussian")
plot(cbind(Boston$dis, Boston$nox), xlab="Distance from Emp.Center", ylab= "Nitrogen Dioxide", main ="Boston Data s = 2 d = 2", col="green", cex=0.7,
pch=19)
lines(predict(lp)[order(dis)]~sort(dis), data=Boston, lwd=2, col='blue')
do you think local regression works well for multivariate cases?
Suppose that you have some observations uniformly distributed between 0 and 1. You want to use a kernel smoother with 0.5 of bandwidth with no covariates (only dependent variable), how many dimension do you have? What is your best estimator?
Now suppose that you use one single covariate (two-dimensional case) using the same bandwidth of 0.5. Both y and x are uniformly distributed between 0 and 1. Do you expect to have the same number of observations for each bandwidth?
As the number of dimension increases, the number of observation within each bandwidth EXPONENTIALLY decreases. This may hugely affect the predictive power of local regressions with either many variables or few observations.
We are interested in predicting the number of fatalities controlling for taxation of beer. The only objective is to have a strong predictor of fatalities rate. To do so, we have a data set with 336 observations and several social and geographical indicators. Download the data set fatalities and plot a scatter plot of fatalities in function of taxation on beer. What do you notice? What kind of regression would you use?
then
Hint: skeleton of the code is provided.
load("./data/fatalities.rda")
plot( fatal ~ beertax, data=fatalities, main="Fatalities Rate", col="purple")
reg1 <- lm(fatal ~ beertax, data=fatalities)
abline(reg1)
summary(reg1)
##
## Call:
## lm(formula = fatal ~ beertax, data = fatalities)
##
## Residuals:
## Min 1Q Median 3Q Max
## -851.4 -634.5 -230.0 137.2 4616.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 878.35 74.86 11.733 <2e-16 ***
## beertax 98.03 106.82 0.918 0.359
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 934.3 on 334 degrees of freedom
## Multiple R-squared: 0.002515, Adjusted R-squared: -0.0004712
## F-statistic: 0.8422 on 1 and 334 DF, p-value: 0.3594
reg2 <- lm(fatal ~ poly(beertax,2), data=fatalities)
summary(reg2)
##
## Call:
## lm(formula = fatal ~ poly(beertax, 2), data = fatalities)
##
## Residuals:
## Min 1Q Median 3Q Max
## -903.5 -573.7 -244.6 121.6 4494.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 928.66 50.68 18.323 <2e-16 ***
## poly(beertax, 2)1 857.41 929.04 0.923 0.3567
## poly(beertax, 2)2 2030.04 929.04 2.185 0.0296 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 929 on 333 degrees of freedom
## Multiple R-squared: 0.01662, Adjusted R-squared: 0.01071
## F-statistic: 2.813 on 2 and 333 DF, p-value: 0.06144
library(KernSmooth)
ksm <- ksmooth(x=fatalities$beertax, y=fatalities$fatal, kernel="box", bandwidth=0.10)
plot( fatal ~ beertax, data=fatalities, main="Fatalities Rate", col="purple")
lines(ksm, lwd=2, col="black")
ksm <- ksmooth(x=fatalities$beertax, y=fatalities$fatal, kernel="box", bandwidth=0.50)
plot( fatal ~ beertax, data=fatalities, main="Fatalities Rate", col="purple")
lines(ksm, lwd=2, col="black")
ksm <- ksmooth(x=fatalities$beertax, y=fatalities$fatal, kernel="normal", bandwidth=0.10)
plot( fatal ~ beertax, data=fatalities, main="Fatalities Rate", col="purple")
lines(ksm, lwd=2, col="black")
set.seed(123)
n <- length(fatalities$fatal)
k <- 20
ii <- sample(rep(1:k, length= n))
mspepol <- mspe01 <- mspe03 <- mspe05 <- mspe1 <- rep(NA, length = k)
for (j in 1:k){
hold <- (ii == j)
train <- (ii != j)
xx.tr <- fatalities[train,]
xx.te <- fatalities[hold,]
pol2 <- lm(fatal ~ poly(beertax,2), data=xx.tr)
ksm0.1 <- ksmooth(x=xx.tr$beertax, y=xx.tr$fatal, kernel="normal", bandwidth=0.10)
ksm0.3 <- ksmooth(x=xx.tr$beertax, y=xx.tr$fatal, kernel="normal", bandwidth=0.30)
ksm0.5 <- ksmooth(x=xx.tr$beertax, y=xx.tr$fatal, kernel="normal", bandwidth=0.50)
ksm1 <- ksmooth(x=xx.tr$beertax, y=xx.tr$fatal, kernel="normal", bandwidth=1)
prpol <- predict(pol2, newdata=xx.te)
## predict does not work with ksmooth, here a possible approximation
pr.01 <- pr.03 <- pr.05 <- pr.1 <- NULL
for (i in xx.te$beertax){
pr.01 <- append(pr.01, ksm0.1$y[which.min(abs(ksm0.1$x - i))])
pr.03 <- append(pr.03,ksm0.3$y[which.min(abs(ksm0.3$x - i))])
pr.05 <- append(pr.05,ksm0.5$y[which.min(abs(ksm0.5$x - i))])
pr.1 <- append(pr.1,ksm1$y[which.min(abs(ksm1$x - i))])
}
mspepol[j] <- mean((xx.te$fatal - prpol)^2)
mspe01[j] <- mean((xx.te$fatal - pr.01)^2)
mspe03[j] <- mean((xx.te$fatal - pr.03)^2)
mspe05[j] <- mean((xx.te$fatal - pr.05)^2)
mspe1[j] <- mean((xx.te$fatal - pr.1)^2)
}
mspe <- cbind(mspepol, mspe01, mspe03, mspe05, mspe1)
colMeans(mspe)
## mspepol mspe01 mspe03 mspe05 mspe1
## 869388.8 685685.6 791856.2 819653.5 859488.6
library(MASS)
fatalities <- na.omit(fatalities)
# Best loess
n <- length(fatalities$fatal)
k <- 10
mspe.final <- NULL
ii <- (1:n) %% k + 1
set.seed(17)
for (j in 1:k){
hold <- (ii == j)
train <- (ii != j)
xx.tr <- fatalities[train,]
xx.te <- fatalities[hold,]
mspe <- NULL
for (i in seq(from=0.1,to=2,by=0.1)){
lp <- loess(fatal ~ beertax, data= xx.tr, span=i, degree=2, family="gaussian",control=loess.control(surface="direct"))
pr <- predict(lp, newdata=xx.te)
mspe <- append(mspe, mean((xx.te$fatal - pr)^2))
}
mspe.final <- cbind(mspe.final, mspe)
}
plot(cbind(seq(from=0.1,to=2,by=0.1), rowMeans(mspe.final)), type="n", main="fatal(beertax)", xlab="span", ylab="MSPE")
lines(cbind(seq(from=0.1,to=2,by=0.1), rowMeans(mspe.final)) ,col="blue", lwd=2)
plot( fatal ~ beertax, data=fatalities, main="Fatalities Rate", col="purple")
ppm <- loess(fatal ~ beertax, data= fatalities, span=0.1, degree=2, family="gaussian",control=loess.control(surface="direct"))
lines(predict(ppm)[order(beertax)]~sort(beertax), data=fatalities, lwd=2, col='blue')
## Smoothing Spline using LOOCV
fatalities <- na.omit(fatalities)
tmp.cv <- smooth.spline(x= fatalities$beertax, y=fatalities$fatal, cv=TRUE,
all.knots=TRUE)
# compute the optimal fit
tmp <- smooth.spline(x= fatalities$beertax, y=fatalities$fatal, cv=FALSE,
all.knots=TRUE, spar=tmp.cv$spar)
plot( fatal ~ beertax, data=fatalities, main="Fatalities Rate", col="purple")
lines(tmp, col="blue", lwd=2)
**Close to the smoother…*
## Build a lasso regression
library(glmnet)
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.2.5
## Loading required package: foreach
## Loaded glmnet 2.0-5
lambdas.rr <- exp(seq(-10, 2, length=50))
xx <- fatalities[,-c(17,18,19,20,21,22,23,24,25,26)]
xx <- model.matrix(~., data=xx)[,-1] ## <- Use this as covariate matrix
y <- fatalities[,17]
k <- 10
p <- ncol(xx)
n <- nrow(xx)
xx <- scale(scale(xx), center=FALSE, scale=rep(sqrt(n-1), p))
lr.cv <- cv.glmnet(x=xx, y=y, lambda=lambdas.rr, nfolds=k, alpha=1)
plot(lr.cv)
lr.final <- glmnet(x=as.matrix(xx), y=as.vector(y), lambda=lr.cv$lambda.1se, alpha=1)
rownames(lr.final$beta)[which.max(abs(as.vector(lr.final$beta)))]
## [1] "milestot"
n <- length(fatalities$fatal)
k <- 10
mspe.final <- NULL
ii <- (1:n) %% k + 1
set.seed(17)
for (j in 1:k){
hold <- (ii == j)
train <- (ii != j)
xx.tr <- fatalities[train,]
xx.te <- fatalities[hold,]
mspe <- NULL
for (i in seq(from=0.1,to=2,by=0.1)){
lp <- loess(fatal ~ milestot, data= xx.tr, span=i, degree=2, family="gaussian",control=loess.control(surface="direct"))
pr <- predict(lp, newdata=xx.te)
mspe <- append(mspe, mean((xx.te$fatal - pr)^2))
}
mspe.final <- cbind(mspe.final, mspe)
}
rowMeans(mspe.final)
## [1] 35722.60 36314.22 36471.02 36529.62 36497.40 36427.97 36467.93
## [8] 36532.42 37108.30 39568.14 39675.23 39811.85 39942.90 40057.44
## [15] 40154.87 40237.26 40307.06 40366.52 40417.50 40461.51
## Best span: 0.1
n <- length(fatalities$fatal)
k <- 20
mspe.final <- NULL
ii <- (1:n) %% k + 1
set.seed(17)
xx <- fatalities[,-c(1,2,14,15,16,17,18,19,20,21,22,23,24,25,26)]
xx <- model.matrix(~., data=xx)[,-1] ## <- Use this as covariate matrix
y <- fatalities[,17]
pr <- rep(0, length= n)
for (j in 1:k){
hold <- (ii == j)
train <- (ii != j)
xx.tr <- xx[train,]
y.tr <- y[train]
xx.te <- xx[hold,]
y.te <- y[hold]
p <- ncol(xx.tr)
n <- nrow(xx.tr)
xxm <- scale(scale(xx.tr), center=FALSE, scale=rep(sqrt(n-1), p))
lr.cv <- cv.glmnet(x=xxm, y=y.tr, lambda=lambdas.rr, nfolds=k, alpha=1)
## Find the variable
lr.final <- glmnet(x=as.matrix(xxm), y=as.vector(y.tr), lambda=lr.cv$lambda.min, alpha=1)
## Use the variable to build your new data frame
dat.tr <- xx.tr[,which.max(abs(as.vector(lr.final$beta)))]
dat.te <- xx.te[,which.max(abs(as.vector(lr.final$beta)))]
dd.tr <- as.data.frame(cbind(dat.tr, y.tr))
dd.te <- as.data.frame(cbind(dat.te, y.te))
## Assign names to the data frame
names(dd.tr) <- names(dd.te) <- c("x", "y")
## Cosntruct a polynomial and predict
lp <- loess(y ~ x, data=dd.tr, span=0.1, degree=2, family="gaussian",control=loess.control(surface="direct"))
pr[hold] <- predict(lp, newdata=dd.te)
}
mean((y-pr)^2)
## [1] 36263.98